Four-copy number alteration (CNA)-related lncRNA prognostic signature for liver cancer

The objective of this study was to identify CNA-related lncRNAs that can better evaluate the prognosis of patients with liver cancer. Prognostic molecular subtypes were identified, followed by tumor mutation and differential expression analyses. Genomic copy number anomalies and their association with lncRNAs were also evaluated. A risk model was built based on lncRNAs, as well as a nomogram, and the differences in the tumor immune microenvironment and drug sensitivity between the High_ and Low_risk groups were compared. Weighted gene co-expression network analysis was used to identify modules with significant enrichment in prognostic-related lncRNAs. In total, two subtypes were identified, TP53 and CTNNB1 were common high-frequency mutated genes in the two subtypes. A total of 8,372 differentially expressed (DE) mRNAs and 798 DElncRNAs were identified between cluster1 and cluster2. In addition, a four-lncRNA signature was constructed, and statistically significant differences between the Low_ and High_risk groups were found in terms of CD8 T cells, resting memory CD4 T cells, etc. Enrichment analysis showed that prognostic-related lncRNAs were involved in the cell cycle, p53 signaling pathway, non-alcoholic fatty liver disease, etc. A prognostic prediction signature, based on four-CNA-related lncRNAs, could contribute to a more accurate prognosis of patients with liver cancer.


Results
Identification of prognostic molecular subtype. A total of 6060 mRNAs, 3966 methylation genes, and 4961 CNA regions with significant prognostic associations were obtained. In addition, two subtypes, including cluster1 (n = 101) and cluster2 (n = 234), were identified using iClusterPlus (Table 1). Cluster2 had the most favorable prognosis ( Fig. 2A). PCA results showed the mRNA expression pattern and methylation pattern in the two subtypes were different (Fig. 2B,C). Moreover, based on the methylation level values of prognostic related methylated genes in each sample, the hierarchical clustering analysis was conducted. Hierarchical clustering analysis results revealed that the samples were divided into three groups, and total 116, 218, one samples were included in 1, 2, 3 groups, respectively, and the two identified clusters were tending to cluster together (Fig. 2D). Moreover, to further observe the tendentiousness of the identified clusters contained in each group in the hierarchical clustering, the sample distribution proportion diagram of each cluster in group 1 and 2 was drawn (Fig. 2E), the results shown that group 2 was mainly included cluster2, and group 1 was mainly contained cluster1, and the methylation pattern in the two groups were found different using chi square test (P < 0.05). In addition, the frequency of 102 mutated genes showed significant differences between two subtypes ( Fig. 3A; Table 2), and there were more mutated genes in cluster1. Among the 102 mutated genes, the top10 www.nature.com/scientificreports/ high-frequency mutated genes in subtypes are shown in Fig. 3B,C, and TP53 and CTNNB1 were common highfrequency mutated genes in the two subtypes.
LncRNAs abnormal expressions related to CNAs. The variant frequency of lncRNAs in the samples was calculated to evaluate the association between CNAs and lncRNA expression. The frequency of copy number gains and losses of lncRNAs on each chromosome varied (Fig. 4A); for example, numerous copies of chromosomes 4,8,9, and 17 were deficient, whereas there were a greater number of copies of chromosomes 5, 6, and 7. In addition, based on the expression profile and CNA profile of 1,358 lncRNAs, the correlation distribution between the copy number and lncRNA expression profile showed an overall trend of positive association (Fig. 4B). Numerous regions with lncRNA copy number gain and loss were revealed (Fig. 4C), indicating that the abnormal lncRNAs copy number might be associated with the progression of liver cancer. In addition, as shown in the heatmap (Fig. 4D), the variant ratio of lncRNAs in cluster1 increased compared to that in cluster2, and the Chi square test results shown that among the 1,358 lncRNAs, total 1,238 lncRNAs had significant difference on CNA between cluster1 and cluster2 (Supplementary Table 1). Next, a total of 52 lncRNAs with CNA frequency > 75% in samples were screened, and the differences in the expression of these 52 lncRNAs with copy number gain, copy number loss, and normal copy number were evaluated using Kruskal-Wallis test. The results    Establishment of a lncRNA signature. A total of 34 lncRNAs were screened as candidate lncRNAs, as described in the Methods section. Univariate Cox regression analysis was conducted, and a total of 12 prognostic-related lncRNAs were identified (Table 3). LASSO Cox regression analysis was performed, and four lncRNAs were utilized to build the signature (Fig. 5A), containing LOC339803, F11-AS1, PCAT2, and TMEM220-AS1.  Construction of the nomogram. After the univariate Cox regression analysis was carried out, Risk-Group, AJCC_PATHOLOGIC_TUMOR_STAGE, NEW_TUMOR_EVENT_AFTER_INITIAL_TREATMENT were identified with P < 0.05 (Fig. 7A), and these characteristics were used to build a nomogram (Fig. 7B). The calibration curves were matched to actual 1-, 3-, and 5-year survival (Fig. 7C).

Clinical characteristics. The distribution of each clinical characteristic in the High_ and Low_risk groups
was statistically explored, and the results showed significant differences in Pathologic-T, Pathologic-stage, Grade, Vascular invasion between the Low_ and High_risk groups (Table 4). In addition, the High_risk group presented more cluster1 samples ( Supplementary Fig. 3), which might explain the poor prognosis of patients in the High_risk group. Fig. 8A, ten immune cells, including memory B cells, regulatory T cells (Tregs), and M0 macrophages, showed obvious differences between the Low_ and High_risk groups. There were also statistically significant differences in immune score, estimate score, and tumor purity between the Low_ and High_risk groups (Fig. 8B).

Tumor immune microenvironment. As shown in
Drug sensitivity prediction. The IC50 of 138 drugs was quantified, and the differences between High_ and Low_risk groups were compared. In the case of 30 of these drugs (including Erlotinib, Lapatinib, and Gefitinib), a significant difference in IC50 was found between the two groups (Supplementary Fig. 4; Table 5), suggesting that the High_risk group may be more resistant to these drugs.
Identification of enriched lncRNA modules. The WGCNA package was employed to build a scale-free co-expression network, and the soft threshold power for matrix transformation was analyzed with the square of the related coefficient between log2k and log2p (k) being 0.85, and the power = 10 (Fig. 9A). For each module, the minimum number of genes was set to 30, and the similarity was greater than 0.1. These modules were clustered, and the modules with correlation coefficients greater than 0.8 were merged, yielding a total of seven modules (Fig. 9B). The two lncRNAs were clustered into a gray module, and the other two lncRNAs, TMEM220-AS1 (blue module) and F11-AS1 (brown module), were further analyzed. The enrichment analysis showed that the blue module was enriched in 487 GO-BP terms and 19 KEGG pathways (including cell cycle, p53 signaling pathway, DNA replication), and the brown module was enriched in 168 GO-BP terms and 15 KEGG pathways (including non-alcoholic fatty liver disease, fatty acid degradation, etc.) (Fig. 9C,D).

Discussion
CNAs have important functions in tumor progression 21 . In the present study, iClusterPlus was utilized for cluster analysis based on mRNA expression, methylation, CNA data, and iClusterplus, which can decrease the dimension of a dataset without altering the sample size. The results showed that two subtypes, cluster1 and cluster2, were identified. Cluster2 had the most favorable prognosis, and the CNA frequency of lncRNAs in cluster1 was higher than that in cluster2, which suggests that CNA-related lncRNAs were correlated with the prognosis of patients with liver cancer. Moreover, TP53 and CTNNB1 were common high-frequency mutated genes in both subtypes. In cancer, TP53 is the most frequently mutated gene, and more than 50% of human tumors carry TP53 gene mutations, including liver cancer [22][23][24] . Mutations in CTNNB1 have been implicated in the pathogenesis of  25 . These results indicate that subtype classification might help evaluate the prognosis of patients with liver cancer and have specific regulatory relationships at the level of transcription, genome, and epigenome.   26 , however, prognostic biomarkers of four lncRNAs associated with CNA were screened after LASSO Cox regression analysis in this study, containing LOC339803, F11-AS1, PCAT2, and TMEM220-AS1, and these four lncRNAs were further used to build the CNA-related lncRNA prognostic model for liver cancer. The patients in High_risk group had a poorer prognosis than patients in Low_risk group in all sets. The AUCs at 1-, 3-, and 5-year survival times in all sets were all approximately 0.7, suggesting that the performance of the lncRNA signature was reliable. In addition, patients with high expression of F11-AS1 and TMEM220-AS1 had a favorable prognosis, whereas high expression of LOC339803 and PCAT2 was associated with poor prognosis. Du et al. found that lncRNA F11-AS1 regulates PTEN expression by competitive binding with miR-3146 and inhibits the progression of liver hepatocellular carcinoma, and F11-AS1 may be used as a therapeutic target for liver hepatocellular carcinoma 27 30 . Our results are consistent with those reported above. However, few studies have reported PCAT2 expression in liver cancer. In addition, enrichment analysis showed that TMEM220-AS1 is involved in the cell cycle, DNA replication pathways, p53 signaling pathway, etc., and F11-AS1 is involved in non-alcoholic fatty liver disease, fatty acid degradation pathways, etc. The cell cycle is a complex process that is regulated by a variety of proteins at multiple levels, and the cell cycle pathway plays a crucial role in tumorigenesis 31 . Studies have reported that the p53 signaling pathway plays a vital role in the regulation of tumor progression [32][33][34] .
DNA replication is a basic biological process, in this process, disorder can lead to genomic instability, which is a hallmark of cancer 35 . Non-alcoholic fatty liver disease can develop into cirrhosis via fibrosis and can be www.nature.com/scientificreports/ complicated by hepatocellular carcinoma 36,37 . As fatty acids are essential for cancer cell proliferation, fatty acid degradation could provide a therapeutic strategy 38 . Thus, LOC339803, F11-AS1, PCAT2, and TMEM220-AS1 might have vital functions in the pathogenesis of liver cancer and could be used as prognostic markers for the cancer.
Because changes in the immune microenvironment have a profound effect on the progression of liver cancer 39 , the immune microenvironment changes were analyzed using the ESTIMATE algorithm and CIBERSORT. The results revealed that ten immune cells, and the immune, estimate scores, and tumor purity had obvious differences between Low-and High_risk groups. It has been reported that in hepatocellular carcinoma, regulatory T cells (Tregs) and exhausted CD8 T cells are increased and may clonally expand 40 . In patients with liver cancer, tumor-associated macrophages (TAMs) are regularly increased through immunohistochemical staining 41 . Rohr-Udilova et al. found that resting mast cells in hepatocellular carcinoma were increased when compared to healthy livers 42 . In addition, high immune and estimate scores are correlated with clinicopathological characteristics and poor prognosis in cancer 43 . In addition, statistically significant differences in IC50 of 30 drugs were found when Low-and High_risk groups were compared, among these drugs were Erlotinib, Lapatinib, Gefitinib, etc. These results revealed that these CNA-related lncRNA signatures might better predict the survival of patients with liver cancer, and these ten immune cells are related to the progression of liver cancer.
However, this study had some limitations. First, the data analyzed were downloaded from public databases, and external validation was required to show the utility of lncRNAs related signatures. Second, the fold-change was not calculated in the drug sensitivity prediction anlaysis dut to no quantized value that can represent resistant or sensitive, and further research should be conducted. Besides, the lncRNAs from which short peptide are transcribed should be considered in this study. In addition, the immune cell proportion was estimated using only the CIBERSORT algorithm, further relevant experiments should be carried out to verify this. Moreover, that's would be better if there were more relevant experiments to validate the biomarkers and pathways identified in this study.

Conclusion
In summary, a CNA-related lncRNA prognostic signature, which is closely correlated with the immune microenvironment, was constructed in this study. This signature is likely to improve the accuracy of liver cancer prognosis and provide insights into predictive biomarkers or potential targets for patients with liver cancer.

Materials and methods
Data collection and processing. The gene expression RNA-seq (log2(fpkm + 1)) data of GDC TCGA LIHC were downloaded from the UCSC Xene platform (https:// xenab rowser. net/) 44 , and the genes with expression less than 1 in more than half of the samples were filtered out. Those genes with "protein_coding" annotation (based on the downloaded gene annotation file in GENCODE V22 version) were reserved as mRNA, and the genes with "antisense, " "sense_intronic, " and "lincRNA", etc., annotation information were reserved as lncRNA. In addition, the CNA (cna_hg19.seg; https:// cbiop ortal-datah ub. s3. amazo naws. com/ lihc_ tcga. tar. gz; Affymetrix SNP 6.0 array), 450k methylation (gene level methylation values, and the probe with the most obvious negative correlation with the gene was selected as the methylation value of the gene, so that each gene has a methylation level value), and clinical and survival information in the TCGA database were obtained from the cBioportal   45 . The samples corresponding to RNA-seq, CNA, 450K methylation, and clinical survival information (OS and OS.time) were matched one by one, and the samples with these data were retained. As a result of these screenings, a total of 335 samples meeting the requirements were obtained; the clinical features of these samples are shown in Table 4.
Screening of prognostic molecular subtype. Based on the mRNA expression, 450k methylation, CNA, and survival information of the 335 samples, the FsbyCox function in the CancerSubtypes package 46 was employed to perform the univariate Cox regression analysis, and the prognostic-related characteristics of mRNA, methylation gene, and CNA region were acquired with the cutoff value of P < 0.05. Cluster analysis was then carried out using iClusterPlus 47 in R software package, and the parameter was set as K = 1:6 in order to select the best number of clusters. Combined with the sample survival information, the log-rank test was conducted, and the classification results with the lowest P value were selected to determine the molecular subtypes.
To verify the classification results, principal component analysis (PCA) and hierarchical clustering analysis were performed.
Tumor mutation analysis. The somatic mutation file processed using Mutect software was obtained from the TCGA database 48 . The oncoplot function in maftools 49 R package was employed to draw the waterfall of the top10 mutated genes with a high mutation frequency. The mutation frequency of each gene in different subtypes was calculated, and the differences were compared using the Chi-square test.
Differential expression analysis. The linear regression and empirical Bayesian methods offered in the limma package 50 in R software were utilized to conduct differential expression analysis, and the P values were adjusted using the Benjamini & Hochberg method for multiple comparisons. The DEmRNA and DElncRNA were screened with a cutoff value of P < 0.05 and |log 2 FC|> 0.263 owing to acquiring more DElncRNA for subsequent analysis.
Genomic copy number anomalies and their association to lncRNAs. The GISTIC 2.0 tool 51 was used to define CNA extracted from the TCGA-LIHC dataset with a cutoff value of gain /loss threshold > 0.1 and Q < 0.25 (When using GISTIC2 to the detect significantly gain or loss genomic regions in a group of samples, the integration of all results of gistic can be obtained, including gain and loss regions, and the samples of gain or loss in each region and the Q value in the peak region are acquired). Copy numbers ≥ 1 or ≤ −1 were considered gain and loss, respectively. The variant frequency of lncRNAs in samples was calculated, and the copy number gain and loss distribution of lncRNA in the genome were analyzed using the Rcircos tool 52 . Samples with lncRNA expression profiles were chosen, and Pearson correlation coefficients between CNA and lncRNA expression were analyzed. In addition, the differences of the expression of lncRNA with CNA frequency > 75% between normal, copy loss, and gain samples were compared using Kruskal-Wallis test.
Screening lncRNA prognostic markers with CNA and construction of lncRNA signature. First, the lncRNAs that met the following criteria were considered as candidate lncRNAs: CNA frequency > 5%, a significant positive correlation between CNA and expression (correlation coefficient > 0.3 and P < 0.05), and DE between different subtypes. Univariate Cox regression analysis was then performed to identify prognosticrelated lncRNAs using the Survminer package (P < 0.05) 53 . In addition, the samples in the TCGA database were categorized into training set (n = 234) and testing set (n = 101) based on 7:3, and the whole sample dataset was used as the validation set (n = 335). In the training set, the LASSO Cox regression analysis was performed using the glmnet package 54 , and a 20-fold cross-validation was utilized to build the lncRNA signature. The risk score was analyzed using the following formula: Risk score = β lncRNA1 × expr lncRNA1 + β lncRNA2 × expr lncRNA2 + … + β lncRNAn × expr lncRNAn (β represents the regression coefficient of lncRNAs, and expr represents the lncRNA expression level). The samples were then categorized into High_ and Low_risk groups based on the median risk score. Kaplan-Meier survival curve analysis was then conducted. In addition, the model was validated in the testing and validation sets. To verify the prognostic performance of the lncRNA signature, receiver operating characteristic (ROC) analysis was carried out in three sets using the survivalROC package 55 .

Development of the nomogram.
Nomogram is a method to display the results of the signature intuitively and effectively, and is conveniently applied in the prediction of the outcome. It uses the length of the line to represent the different variables, thereby exhibiting the effect of different variable values on the outcome. To test whether the Riskscore model was an independent prognostic factor, univariate and multivariate Cox www.nature.com/scientificreports/ regression analyses were carried out on RiskGroup and clinical characteristics, including AGE, AJCC_PATHO-LOGIC_TUMOR_STAGE, and GRADE, and the characteristics with P < 0.05, were used to build a nomogram. The nomogram was validated by assessing the discrimination and calibration. To be clear, the calibration curve of the nomogram was plotted to observe the nomogram prediction probabilities against the observed rates. statistically explored, and the differences were compared using the Chi-square test. In addition, the distribution of subtype samples between the High_ and Low_risk groups was evaluated.
Tumor immune microenvironment. Stromal, immune, estimated scores, and tumor purity were evaluated using the ESTIMATE algorithm 56 . In addition, CIBERSORT 57 was employed to evaluate the fractions of 22 tumor-infiltrating immune cells, and the differences in stromal score, immune score, estimate score, tumor purity, and fractions of 22 tumor-infiltrating immune cells were analyzed using the Wilcox test.
Drug sensitivity prediction. The Genomics of Drug Sensitivity in Cancer (GDSC) database 58 was utilized to assess the sensitivity of patients in High_ and Low_risk groups to chemotherapy drugs. The IC50 of 138 drugs was calculated using the pRRophetic algorithm 59 in R, and the differences were compared using the t-test.
Identification of enriched lncRNA modules by WGCNA. The WGCNA package 28 was employed to build a scale-free co-expression network based on the combined expression profiles of DEmRNAs and lncRNAs in the prognostic model, and the highly covarying gene set modules were identified. The mRNA in the same module serves as a potential lncRNA target gene. Enrichment analysis was conducted on the mRNA in the modules using clusterprofiler 60 , with the parameters of pAdjustMethod = "BH" and P < 0.05.